Preparation

First things first... Let's read the CSV files and take a look at what's inside them

daily_device_counts

No surprises here... As the description provided:

For someone not very familiar with the dataset (like me at this point) making the column names very self explanatory help later on when doing the analysis (specially when you have multiple data sets involved). So I'll rename some of the variables so it's obvious what they represent.

Apparently no NULL values... Let's double check and make sure there are no strings that actually represent NULLS.

All IDs have 32 characters, so that good indication that the IDs are fine. Let's now try to convert date to a datetime and see if there is any issue with that.

Perfect... We converted date to a datetime type and still no NULLs. Also, let's make sure each combination of fleet_id+date only appears once. As you can see, no problems with that:

fleet_metadata

Again, no surprises here. As the description provided:

I'll rename the uuid to fleet_id for consistency and check for NULLs.

All good with the fleet_id. Let's take a look at device_type and industry:

Perfect! Besides validating the data set for NULLs, we can already notice that most fleets are using Raspberry Pi and also the fact that some industries could perhaps be merged together (eg. Telecommunications and Telecoms).

I'll quickly make that adjustment, but ideally, this kind of data innacuracy would be handled at the source if possible. So in a real life scenario, I'd investigate how the data was generated and try to evaluate if any process/system improvement could be made.

One final check is to make sure each fleet_id appears only once. And again, no issues as you can see:

Since we have no duplicates and each fleet_id is unique in fleet_metadata, I'll merge both data sets:

Descriptive Analysis

The first question I have on my mind is regarding the size of the fleets and how that's evolving over time.

Clearly, there are some fleets that are outliers and quite big compared to the other ones. Let's investigate that a bit more.

Now that we removed some outliers, we can see things like:

Here I isolated the outliers and, again, investigating the reason for these spikes can generate valuable insights. I'd be also curious to identify the fleet that has over 15k devices:

It's already obvious, how the distribution of number of devices is skewed and there two things I still want to take a look at:

Notice how the pareto principle can be applied here. Looking at the most recent date, roughly 77% of all devices come from only 10% of the fleets (9 fleets). Those deserve special attention, since they probably bring most of the company's revenue and the main users of the system.

Predictive analysis

Based on the information we have, let's try to forecast the number of devices we can expect to observe in the future. I'll plot again the total amount of devices over time and notice how it looks like we can adjust a curve using a polinomial regression:

Not great, but it does the job... The curve doesn't adjust really well for values before 2018 (it has even negative values) and the error between the real value and the forecasted by the model for the most recent date doesn't make me very happy. I quickly tried to fit a polynomial regression (also 3 degrees) using google-sheets and it gave a me a better model.

google_sheets_regression

If I had to spend more time working on this task:

  1. I'd try to improve and get a more accurate model
  2. I'd estimate the number of devices for future dates (like 2020 and 2021)
  3. I'd incorporate pricing data and do once again the pareto analysis to check how much concentrated in a few "customers" our revenue is